Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 2.8 - Case Statements")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white |
4 | 4 | 2 | None | 2019-01-01 10:05:10 | 13 | None |
Case Statements
Case statements are usually used for performing stateful calculations.
ie.
- if
x
thena
- if
y
thenb
- everything else
c
Using Switch/Case Statements in Spark
(
pets
.withColumn(
'oldness_value',
F.when(F.col('age') <= 5, 'young')
.when((F.col('age') > 5) & (F.col('age') <= 10), 'middle age')
.otherwise('old')
)
.toPandas()
)
id | breed_id | nickname | birthday | age | color | oldness_value | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | young |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | middle age |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | old |
3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white | old |
4 | 4 | 2 | None | 2019-01-01 10:05:10 | 13 | None | old |
What Happened?
Based on the age of the pet, we classified if they are either young
, middle age
or old
. Please don't take offense, this is merely an example.
We mapped the logic of:
- If their age is younger than or equal to 5, then they are considered
young
. - If their age is greater than 5 but younger than or equal to 10 , then they are considered
middle age
. - Anyone older is considered
old
.
Summary
- We learned how to map values based on case statements and a deafult value if all conditions are not satified.